Practitioner's Docket No. 3193/102 



PATENT 



IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 

In re application of: Vladimir Miloushev, Peter Nickolov 

Application No.: 10/043,413 Group No.: 2142 

Filed: 01/10/2002 Examiner: Donaghue, Larry D. 

For: File Switch and Switched File System 

Mail Stop Issue Fee 
Commissioner for Patents 
P.O. Box 1450 
Alexandria, VA 22313-1450 

PETITION TO WITHDRAW HOLDING OF ABANDONMENT 
BASED ON EVIDENCE THAT A REPLY WAS TIMELY MAILED OR FILED 

1. I hereby petition to withdraw the holding of abandonment in this case, on the basis that a reply to 
the Notice of Allowance of 8/16/07 was timely filed. 

2. I hereby state: 

(a) A Request for Continued Examination, including a supplemental Information Disclosure 
Statement, was filed on October 30, 2007 by first class mail. 

(b) The date-stamped return postcard from the PTO was not received. 

(c) As of today' s date, there is no indication on PAIR that the RCE was received by the PTO. 

3. Attached is a copy of the RCE and IDS as filed on October 30, 2007. Please note that our file 
copy of the submission includes only the first page of each submitted reference. Complete copies 
of the references will be submitted to the PTO under separate cover. 

4. As the person who signed the certificate of mailing, I hereby attest to personally placing the RCE 
with supplemental IDS and references in an envelope (box) addressed to Mail Stop RCE, 
Commissioner for Patents, P.O. Box 1450, Alexandria, VA 22313-1450 and handing the 
envelope to our firm's mail clerk, Colin Hoyle, who, as I recall, remained at the office after 
normal business hours specifically for the purpose of mailing the envelope, and, to the best of my 
knowledge and belief, did in fact hand-deliver the envelope to the U.S. Post Office at Boston's 
South Station on October 30, 2007 with sufficient postage as first class mail. 

5. In consideration of these submissions, it is respectfully requested that the holding of abandonment 
be withdrawn. 

6. I believe that no fee is required for this petition. However, please charge Deposit Account 19- 
4972 for any fees that are required by this paper. 
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Date: December 21, 2007 
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Signature of Practitioner 



Jeffrey T. Klayman 
Reg. No. 39,250 
Bromberg & Sunstein LLP 
125 Summer Street 
Boston, MA 021 10 
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Under the Paperwork Reduction Act of 1 995. no persons are a 

^ Request 
for 

Continued Examination (RCE) 
Transmittal 

Address to: 
Mail Stop RCE 
Commissioner for Patents 
P.O. Box 1450 
Alexandria, VA 22313-1450 



PTO/SB/30 (10-07) 
Approved for use through 10/31/2007. OMB 0651-0031 
U.S. Patent and Trademark Office; U.S. DEPARTMENT OF COMMERCE 
iless it contains a valid OMB control number. 



Application Number 



First Named Inventor 



Examiner Name 



Attorney Docket Number 



10/043,413 



January 10, 2002 



Vladimir Miloushev 



This is a Request for Continued Examination (RCE) under 37 CFR 1.114 of the above-identified application. 

Request for Continued Examination (RCE) practice under 37 CFR 1.114 does not apply to any utility or plant application filed prior to June 8, 
1995, or to any design application. See Instruction Sheet for RCEs (not to be submitted to the USPTO) on page 2. 



I Submission required Under 37 CFR 1.114 Note: If the RCE is proper, any previously filed unentered amendments and 
amendments enclosed with the RCE will be entered in the order in which they were filed unless applicant instructs otherwise. If 
applicant does not wish to have any previously filed unentered amendment(s) entered, applicant must request non-entry of such 
amendment(s). 



□ 



Consider the arguments i 
I I Other 



the Appeal Brief or Reply Brief previously filed oi 



LAJ Enclosed 
I. Q AmendmentyReply 
jj. Q Affidavit(s)/ Declaration(s) 



iii. [XJ Information Disclosure Statement (IDS) 
lv - CH Other 



Suspension of action on the above-identified application is requested under 37 CFR 1.103(c) for a 

period of months. (Period of suspension shall not exceed 3 months; Fee under 37 CFR 1.17(1) required) 

Other 



| Fees] The RCE fee under 37 CFR 1.17(e) is required by 37 CFR 1.114 when the RCE Is filed. 

r— | The Director Is hereby authorized to charge the following fees, any underpayment of fees, or credit any overpayments, to 

a- I2LI Deposit Account No. 19-4972 . I have enclosed a duplicate copy of this sheet. 

i. [X] RCE fee required under 37 CFR 1.17(e) 

ii. Q Extension of time fee (37 CFR 1.136 and 1.17) 

iii. □ Other 

b. r~j Check in the amount of $ enclosed 

c. r~j Payment by credit card (Form PTO-2038 enclosed) 



IRE OF APPLICANT, ATTORNEY, OR AGENT REQUIRED 



■J^APPUC* 



Signature 



Name (Print/Type) j Jeffrey T. Kla; 



October 30 ? 2007 



CERTIFICATE OF MAILING OR TRANSMISSION 



I hereby certify that this correspondence is being deposited with the United States Postal Service with sufficient postage as 
addressed to: Mail Stop RCE, Commissioner^ Patents, P. O. Box 1450, Alexandria, VA 22313-1450 or facsimile transmitted to th 
Office on the date shown below. 



issionerfpr Patents, P. O. Box 1450, Al 

the** — ; ' 



I D^e |Qctober30,2007 



Name (Print/Type) | Jeffrey T. Klavrafen 



This collection of information is required by 37 CFR 1.1 14. The information is required to obtain or retain a benefit by the public which is to file (and by the USPTO 
to process) an application. Confidentiality is governed by 35 U.S.C. 122 and 37 CFR 1.11 and 1.14. This collection is estimated to take 12 minutes to complete, 
including gathering, preparing, and submitting the completed application form to the USPTO. Time will vary depending upon the individual case. Any comments on 
the amount of time you require to complete this form and/or suggestions for reducing this burden, should be sent to the Chief Information Officer. U.S. Patent and 
Trademark Office, U.S. Department of Commerce, P.O. Box 1450, Alexandria, VA 22313-1450. DO NOT SEND FEES OR COMPLETED FORMS TO THIS 
ADDRESS. SEND TO: Mail Stop RCE, Commissioner for Patents, P.O. Box 1450, Alexandria, VA 22313-1450. 

If you need assistance in completing the form, call 1-800-PTO-9199 and select option 2. 



P UP U CATS. 



Under the Paperwork Reduction Act of 1 995. no persons are requi ed to respond 

f ~ Request 



for 

Continued Examination (RCE) 
Transmittal 

Address to: 
Mail Stop RCE 
Commissioner for Patents 
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Approved for use through 10/31/2007. OMB 0651-0031 
S. Patent and Trademark Office; U.S. DEPARTMENT OF COMMERCE 
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Filing Date 



First Named Inventor 



Examiner Name 



Attorney Docket Number 



10/043,413 



January 10, 2002 



Vladimir Miloushev 



This is a Request for Continued Examination (RCE) under 37 CFR 1.114 of the above-identified application. 
Request for Continued Examination (RCE) practice under 37 CFR 1.114 does not apply to any utility or plant application filed prior to June 8 
1995, or to any desi.qn application. See Instruction Sheet for RCEs (not to be submitted to the USPTO) on page 2. 



(Submission required under 37 CFR 1.1 14] Note: If the RCE is proper, any previously filed unentered amendments and 
amendments enclosed with the RCE will be entered in the order in which they were filed unless applicant instructs otherwise. If 
applicant does not wish to have any previously filed unentered amendment(s) entered, applicant must request non-entry of such 
amendment(s). 



□ Consider the arguments in the Appeal Brief or Reply Brief previously filed oi 
I I Other . 



LAI Enclosed 
I. Amendment/Reply 

| j Affidavit(s)/ Declaration(s) 



|X| Information Disclosure Statement (IDS) 
| I Other 



I have enclosed a duplicate copy of this sheet. 



2. ( Miscellaneous [ 

□ Suspension of action on the above-identified application is requested under 37 CFR 1.103(c) for a 
period of months. (Period of suspension shall not exceed 3 months; Fee under 37 CFR 1.17(1) required) 
b. Other, 

3 [ Fees] The RCE fee under 37 CFR 1 .17(e) is required by 37 CFR 1.114 when the RCE is filed. 

I — | The Director is hereby authorized to charge the following fees, any underpayment of fees, or credit any overpayments, to 
a- I2U Deposit Account No. 19-4972 

i. [X] RCE fee required under 37 CFR 1.17(e) 

ii. Extension of time fee (37 CFR 1.136 and 1.17) 

iii. □ Other 

b. Q Check in the amount of $ 

c. Payment by credit card (Form PTO-2038 enclosed) 

WARNING: Information on this form may become public. Credit card information should not be included on this form. Provide credit 
card Information and authorization on PTO-2038. 



'•PRE OF APPLICANT, ATTORNEY, OR AGENT REQUIRED 



CERTIFICATE OF MAILING OR TRANSMISSION 



I hereby certify that this correspondence is being deposited with the United StatE 
addressed to: Mail Stop RCE, Commissioner fox Patents, P. O. Box 1450, 
Office on the date shown below. 



lissionerfpr Patents, P. O. Box 14! 

laynran 



Name (Print/Type) Jeffrey T. Kla^ 



I Date [October 30, 2007 " 



This collection of information is required by 37 CFR 1.114. The information Is required to obtain or retain a benefit by the public which is to file (and by the USPTO 
to process) an application. Confidentiality is governed by 35 U.S.C. 122 and 37 CFR 1.11 and 1.14. This collection is estimated to take 12 minutes to complete, 
including gathering, preparing, and submitting the completed application form to the USPTO. Time will vary depending upon the individual case. Any comments on 
the amount of time you require to complete this form and/or suggestions for reducing this burden, should be sent to the Chief Information Officer, U.S. Patent and 
Trademark Office, U.S. Department of Commerce, P.O. Box 1450, Alexandria, VA 22313-1450. DO NOT SEND FEES OR COMPLETED FORMS TO THIS 
ADDRESS. SEND TO: Mail Stop RCE, Commissioner for Patents, P.O. Box 1450, Alexandria, VA 22313-1450. 

If you need assistance in completing the form, call 1-800-PTO-9199 and select option 2. 
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IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 

In re application of: Miloushev et al. Group No.: 21 42 

Application Number: 1 0/043 ,4 1 3 Examiner: Prieto, Beatriz 

Filing Date: January 10, 2002 
Title: File Switch and Switched File System 

Mail Stop RCE 
Commissioner for Patents 
P.O. Box 1450 
Washington, DC. 20231 

SUPPLEMENTAL INFORMATION DISCLOSURE STATEMENT 

NOTE: "An information disclosure statement shall be considered by the Office if filed by the applicant: 

(1) Within three months of the filing date of a national application: 

(2) Within three months of the dale of entry of the national stage as set forth in section 1.491 in an international 
application: or 

(3) Before the mailing date of a first Office action on the merits, whichever event occurs last." 37 C.F.R. section 
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CERTIFICATION UNDER 37 C.F.R. SECTIONS 1.8(a) and 1.10* 

(When using Express Mail, the Express Mail label number is mandatory: 
Express Mail certification is optional.) 

I hereby certify that, on the date shown below, this correspondence is being: 

MAILING 

[X] deposited with the United States Postal Service in an envelope addressed to the Commissioner for Patents, P.O. Box 1450, 
Alexandria, V A 22313-1450. 

37 C.F.R. SECTION 1.8(a) 37 C.F.R.SECTION 1.10* 

[X.] with sufficient postage as first class mail. [ ] as "Express Mail Post Office to Addressee" 

Mailing Label No. (mandatory) 

TRANSMISSION 

[ transmitted by facsimile to the Patent and Trademark Office. 



Signature 

Date: October 30, 2007 Jeffrey T. Klavman 

(type or print name of person certifying) 

*WARNING: Each paper or fee filed by "Express Mail" must have the number of the "Express Mail" mailing label placed 
thereon prior to mailing. 37 C.F.R. section 1.10(b). 

"Since the filing of correspondence under section 1.10 without the Express Mail mailing label thereon is an 
oversight that can be avoided by the exercise of reasonable care, requests for waiver of this requirement will 
not be granted on petition. "Notice of Oct. 24, 1996, 60 Fed. Reg. 56,439, at 56,442. 



1.97(b). 



NOTE: "Each individual associated with the filing and prosecution of a patent application has a duty of candor and good faith 
in dealing with the Office, which includes a duty to disclose to the Office all information known to that individual to be 
material to patentability as defined in this section. "37 C.F.R. section 1.56(a). 

"Individuals associated with the filing or prosecution of a patent application within the meaning of this section are: 

(1 ) each inventor named in the application; 

(2) each attorney or agent who prepares or prosecutes the application; and 

(3) every other person who is substantively involved in the preparation or prosecution of the application and who is 
associated with the inventor, with the assignee or with anyone to whom there is an obligation to assign the application. " 
37 C.F.R. section 1.56(c). 

NOTE: The "duty as described in section 1.56 will be met so long as the information in question was cited by the Office or 
submitted to the Office in the manner prescribed by sections 1. 97(b) -(d)andl. 98 before issuance of the patent. " Notice 
of January 9, 1992, 1135 O.G. 13-25 at 17. 

WARNING: "No information disclosure statement may be filed in a provisional application. "37 C.F.R. section 1 .51 (b). 

List of Sections Forming Part of This Information Disclosure Statement 

The following sections are being submitted for this Information Disclosure Statement: 

(check sections forming a part of this statement: discard unused sections and number pages consecutively) 

1 . [x]Preliminary Statements 

2. [xJForms PTO/SB/08A and 08B (substitute for Form PTO-1 449) 

•• 3. .[x] Statement as to Information Not Found in Patents or Publications 

4. [ ^Identification of Prior Application in Which Listed Information Was Already Cited and for 

Which No Copies Are Submitted or Need Be Submitted 

5. [ ] Cumulative Patents or Publications 

6. [x]Copies of Listed Information Items Accompanying This Statement 

7. [ ]Concise Explanation of Non-English Language Listed Information Items 

7A. [ ]EPO Search Report 

7B. [ ]English Language Version of EPO Search Report 

8. [ ]Translation(s) of Non-English Language Documents 

9. [x] Concise Explanation of English Language Listed Information Items (Optional) 

10. [x] Identification of Person(s) Making This Information Disclosure Statement 

(complete the following, if appropriate) 
Sections , respectively, have been continued on ADDED PAGE(S). 

NOTE :"Once the minimum requirements are met, the examiner has an obligation to consider the information. " Notice of April 
20, 1992 (1138 O.G. 37-41, 37). 
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Section 1. Preliminary statements 

Applicants submit herewith patents, publications or other information, of which they are aware that they 
believe may be material to the examination of this application, and in respect of which, there may be a duty 
to disclose. 

The filing of this information disclosure statement shall not be construed as a representation that a 
search has been made (37 C.F.R. section 1.97(g)), an admission that the information cited is, or is 
considered to be, material to patentability, or that no other material information exists. 

The filing of this information disclosure statement shall not be construed as an admission against 
interest in any manner. Notice of January 9, 1992, 1 135 O.G. 13-25, at 25. 
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SECTION 2: FORMS PTO/SB/08A and 08B (formerly Form PTO-1449) 



m THE UNITED STATES PATENT AND TRADEMARK OFFICE 

In re application of: Miloushev et al. Group No.: 2142 

Application Number: 10/043,413 Examiner: Prieto, Beatriz 

Filing Date: January 10, 2002 
Title: File Switch and Switched File System 



LIST OF PATENTS AND PUBLICATIONS FOR 
APPLICANT'S INFORMATION DISCLOSURE STATEMENT 
U.S. Patent Documents 



Examiner 
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TTiffhlinlit 

xlignugm 


urn. 


Patent 


Issue Date 


Inventor 


Class/ 
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AA 


6029168 


02/22/00 


— 


^707/10 






AB 


5897638 


04/27/99 


Lasser etal 


707/102 






AC 


5917998 


06/29/99 


Cabrera, et al. 


— 








5583995 


12/10/96 




—liZ 








5473362 




Htzgerald 










6775679 


08/10/04 


Gupta U Day 











4993030 


02/12/91 




- 371/4Q -j 








6397246 


05/28/02 


Wolfe Daniel 


709/217 








6047129 


04/4/00 




395/712 




* 




6985936 


01/10/06 


A^anralk 611 


709/221 








6044367 


03/28/00 


Wolff, James 


707/2 








5649200 


07/15/97 


Leblang, et al 


395/703 








6516351 


02/4/03 


Borr, Andrea 


709/229 








6393581 


05/21/02 


Friedman, et.al 


714/4 








6223206 


04/21/01 


Dan, et al. 


709/105 








6161145 


12/12/00 


Bainbridge, et al. 


709/246 








6339785 


01/15/02 


Feigenbaum, Idan 


709/213 








6324581 


11/27/01 


Xu, et al. 


709/229 








6438595 


08/20/02 


Blumenau, et al. 


709/226 








6721794 


04/13/04 


Taylor 


709/231 








6516350 


02/14/03 


Lumelsky 


709/226 








5548724 


08/20/96 


Akizawa et al. 


395/200.03 




* 




6782450 


08/24/04 


Arnott et al. 


711/114 








7072917 


07/4/06 


Wong et al. 


707/205 








6233648 


05/2001 


Tomita, Haruo 


711/4 








6556998 


04/2003 


Mukherjee et al. 


707/10 








6985956 


01/2006 


Luke et al. 


709/229 








6990547 


01/2006 


Ulrichetal. 


710/304 








6990667 


01/2006 


Ulrich et al. 


718/105 
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IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 



In re application of: Miloushev et al. Group No.: 2142 

Application Number: 10/043,413 Examiner: Prieto, Beatriz 

Filing Date: January 10, 2002 
Title: File Switch and Switched File System 

LIST OF PATENTS AND PUBLICATIONS FOR 
APPLICANT'S INFORMATION DISCLOSURE STATEMENT 



U.S. Published Patent Applications 



Examiner 
Initials 


Highlight 


Ref. 
Num 


Patent 


Issue Date 


Inventor 


Class/ 
Subclass 








2004/0025013 


02/5/04 


Parker, el al. 


713/163 








2005/0021615 


01/27/05 


Arnott, et al. 


709/203 








2004/0028043 


02/12/04 


Maveli, et al. 


370/392 








2004/0030857 


02/12/04 


Krakirian, et al. 


711/206 








2004/0028063 


02/12/04 


Roy, et al. 


370/402 








2002/0120763 


08/2002 


Miloushev et al. 


709/230 








2002/0161911 


10/2002 


Pinckney el al. 


709/231 








2003/0028514 


02/2003 


Lord et al. 


707/1 








2003/0033308 


02/2003 


Patel et al. 


707/10 








2003/0135514 


07/2003 


Patel et al. 


707/102 








2004/0006575 


01/2004 


Visharam et al. 


707/104.1 








2004/0098383 


05/2004 


Tabellion et al. 


707/003 








2004/0133607 


07/2004 


Miloushev et al 


707/200 








2004/0133577 


07/2004 


Miloushev et al. 


707/010 








2004/0236798 


11/2004 


Srinivasan et al. 


707/200 
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IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 



In re application of: Miloushev et al. Group No. : 2 1 42 

Application Number: 10/043,413 Examiner: Prieto, Beatriz 
Filing Date: January 10, 2002 
Title: File Switch and Switched File System 

LIST OF PATENTS AND PUBLICATIONS FOR 
APPLICANT'S INFORMATION DISCLOSURE STATEMENT 



Non-Patent Documents 



Examiner 
Initials 


Highlight 


Nu 


Author, Title of Article, 

Title of Journal, Volume Number, 






AG 


Callaghan et al, "NFS Version 3 Protocol Specifications" (RFC 1813), 1995, 
12/30/02. E S J ^ 








Norton et al., "CIFS Protocol Version CIFS-Spec 0.9", 2001, Storage 
Networking Industry Association (SNIA), www.snia.ora, last accessed on 
3/26/01. 






AI 


Stakutis, C, "Benefits of SAN-based file system sharing", Jul. 2000, 
InfoStor, www.infostor.com last accessed on 12/30/02. 






AJ 


Haskin et al., 'The Tiger Shark File System", 1996, in proceedings of IEEE, 
Spring COMPCON. Santa Clara. CA, www.research.ibm.com, last accessed 
on 12/30/02. 






AK 


Peterson, M., "Introducing Storage Area Networks", Feb 1998, InfoStor. 
www.infoster.com, last accessed on 12/20/02. 






AL 


Patterson et al, "A case for redundant arrays of inexpensive disks (RAID)", 
Chicago, Illinois, June 1-3, 1998, in proceedings of ACM SIGMOD 
conference on the Management of Data, pp. 109-1 16, Association for 
Computing Machinery, Inc., www.acm.or£, last accessed on 12/20/02. 






AM 


Farley, M., "Building Storage Networks", January 2000, McGraw Hill, ISBN 
0072120509. 






AN 


"Auspex Storage Architecture Guide", Second Edition, 2001, Auspex 
Systems, Inc., www.ausoex.com, last accessed on 12/30/02. 






AO 


"Windows Clustering Technologies-An Overview", November 2000, 
Microsoft Corp., www.microsoft.com, last accessed on 12/30.02. 






AP 


Steven Soltis, et al., 'The Global File System", in Proceedings of the Fifth 
NASA Goddard Space Flight Center Conference on Mass Storage Systems 
and Technologies, September 17-19, 1996, College Park, Maryland. 






AQ 


Preslan et al., "Scalability and Failure Recovery in a Linux Cluster File 
System", in Proceedings of the 4* Annual Linux Showcase & Conference, 
Atlanta. Georgia. October 10-14,2000, www.usenix.ore, last accessed on 
12/20/02. 






AR 


Zayas, E., "AFS-3 Programmer's Reference: Architectural Overview", 
Transarc Corp., version 1 .0 of September 2, 1991, doc. number FS-00- 
D160. 
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Application Number: 1 0/043,41 3 Examiner: Prieto, Beatriz 

Filing Date: January 10, 2002 
Title: File Switch and Switched File System 

LIST OF PATENTS AND PUBLICATIONS FOR 
APPLICANT'S INFORMATION DISCLOSURE STATEMENT 



Non-Patent Documents (Continued) 



Examiner 
Initials 


Highlight 


Num. 


AUinor, Hue oi Arutie, 

Title of Journal, Volume. Number, 

Page Numbers, Date 








Transarc Corp. www.transarc.ibm.com, last accessed on 12/20/02. 






AT 


"VERITAS S ANPoint Foundation Suite(tm) and SANPoint 
Foundation(tm) Suite HA* New VERITAS Volume Management and File 
System Technology for Cluster Environments", September 200 1 , 
VERITAS Software Corp. 






AU 


"Distributed File System: Logical View of Physical Storage: White Paper", 
1999, Microsoft Corp.. www.microsoft.cm, last accessed on 12/20/02. 








Anderson et cil "Serverless Network File System", in the 15 111 Symposium 
on Operating Systems Principles, December 1995, Association for 
Computing Machinery, Inc. 








Gibson et al, "NASD Scalable Storage Systems", lune 1999, USENIX99, 
Extreme Linux Workshop, Monterey, California. 






AX 


Gibson et al, "File Server Scaling with Network-Attached Secure Disks", 
in Proceedings of the ACM International Conference on Measurement and 
Modeling of Computer Systems (Sigmentrics '97), 1997, Association for 
Computing Machinery, Inc. 






AY 


Cabrera et al., 'Swift: Storage Architecture for Large Objects", In 
Proceedings of the Eleventh IEEE Symposium on Mass Storage Systems, 
pages 123-128, Oct 1991. 






BA 


Cabrera et al, "Using Data Striping in a Local Area Network", 1992, 
technical report number UCSC-CRL-92-09 of the Computer & Information 
Sciences Department of University of California at Santa Cruz. 






BB 


Long et al, "Swift/RAID: A distributed RAID System", Computing 
Systems, vol. 7, pp. 333-359, Summer 1994. 




* 


BC 


Hartman, J., "The Zebra Striped Network File System", 1994, Ph.D. 
dissertation submitted in the Graduate Division of the University of 
California Berkeley. 






BD 


Cams et al, "PVFS: A Parallel File System For Linux Clusters", in 
Proceedings of the 4 ft Annual Linux Showcase and Conference, pages 3 17- 
327, Atlanta, Georgia, October 200, USENTX Association. 




* 


BF 


"NERSC Tutorials: I/O on the Cray T3E", chapter 8, "Disk Striping", 
National Energy Research Scientific Computing Center (NERSC), 
hrto://hpcf.nersc.gov, last accessed on 12/27/02. 
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Filing Date: January 10, 2002 
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LIST OF PATENTS AND PUBLICATIONS FOR 
APPLICANT'S INFORMATION DISCLOSURE STATEMENT 



Non-Patent Documents (Continued) 



Examiner 
Initials 


xiigimgni 


Num. 


Author Title of Article 

Title of Journal, Volume Number, 

Page Numbers, Date 






BH 


Thekkath et al, 'Frangipani: A Scalable Distributed File System", in 
Proceedings of the 16 ,h ACM Symposium on Operating Systems 
Principles, October 1997, Association for Computing Machinery, Inc. 






BI 


Hwang et al, Designing SSI Clusters with Hierarchical Checkpointing 
and Single I/O Space", IEEE Concurrency, pp. 60-69, Jan-Mar 1999. 








Cavale, M. R., "Introducing Microsoft Cluster Service (MSCS) in the 
Windows Server 2003," Microsoft Corporation, November 2002. 








Pearson, P.K., "Fast Hashing of Variable-Length Text Strings", Comm. 
Of the ACM, Vol. 33, No. 6, June 1990. 








Sorenson, K.M., "Installation and Administration: Kimberlite Cluster 
Version 1.1.0, Rev. D." Mission Critical Linux, 
http://oss.missioncriticallinux.corn/kimberlite/kimberlite.pdf 








Wilkes, J., et al., "The HP AutoRAID Hierarchical Storage System," 
ACM Transactions on Computer Systems, Vol. 14, No. 1 , February 1 996. 








Savage, et al., "AFRAID- A Frequently Redundant Array of Inexpensive 
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would like the Applicant to supply copies of any or all of the information included in any of these 
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NOTE: 37 C.F.R. section 1.98(a)(2) requires that any information disclosure statement filed under section 1.97 shall include: 
"A legible copy of: (1) Each U.S. and foreign patent: (ii) Each publication or that portion which caused it to be listed; 
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references. In an attempt to facilitate the Examiner's review of the references, Applicants' attorney would 
direct the Examiner's attention to the highlighted references, which, out of the cited references, appear to be 
more closely related to the subject patent application. 

Submission of any particular reference is not an admission that the reference is material to patentability or 
qualifies as prior art to one or more of the claims. 
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NFS Version 3 Protocol Specification 

Status of this Memo 

This memo provides information for the Internet community. 
This memo does not specify an Internet standard of any kind. 
Distribution of this memo, is unlimited. 

IESG Note 

Internet Engineering Steering Group comment: please note that 
the IETF is not involved in creating or maintaining this 
specification. This is the significance of the specification 
not being on the standards track. 

Abstract 

This paper describes the NFS version 3 protocol. This paper is 
provided so that people can write compatible implementations. 
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Introducing Storage Area Networks 
Introducing Storage Area Networks 

Here's a primer to get you started in understanding the benefits of SAN, SAS, and NAS architectures. 
By Michael Peterson 

As IT organizations re-engineer distributed networks to achieve continuous operations and to host 
mission^critical applications, they are increasingly considering an architecture that is common in data 
centers. Mainframe-based data centers use a network storage interface called ESGON to connect 
mainframes to multiple storage systems and distributed networks, an architecture that is referred to as a 
storage-area network (SAN). 

In a typical data center, SANs account for approximately 25% of all network traffic. What's new is 
that SAN architectures are now being adopted in distributed networks using low-cost interconnect 
technologies such as SCSI, SSA, and Fibre Channel. 

A SAN is a high-speed network, similar to a LAN, that establishes a direct connection between storage 
elements and servers or clients. The SAN is an extended storage bus that can be interconnected using 
technologies used in LANs and WANs, such as routers, hubs, switches, and gateways. A SAN can be 
local or remote, shared or dedicated, and it uniquely includes externalized and central storage. SAN 
interfacesare usually'ESCON,.SCSI, SSA, Fibre Channel, or HTPPI, rather than Ethernet. 

It doesn't matter whether a SAN is called a storage-area network or system-area network because the 
architecture is the same. Either way, SANs create a method of attaching storage that is revolutionizing 
networks, resulting in significant improvements in availability and performance. 

SANs are currently used to connect shared-storage arrays, to cluster servers for failover, to 
interconnect mainframe disk of tape resources to distributed network servers and clients, and to create 
paMleLor^ 

In essence, a S <\N is nothing more than another network, like a subnet, except that it'? implemented 
with storage interfaces. S^vNo era bio sto>ag« to Lj externalized hum the server, allowing storage 
d-vices to S e shared ai: long r ultiple host servers with< ui affecting s\stem performance 3f iht pnmary 
.■network. ■ • - v 

SAN'. are iUhcw The benefits a- - we!' piO\ en becaus< tf-e aidiitcotu e evolved fom njinKure 
DASD. In fact, Digital Equipment's VAX/VMS network environment is based on a SAN architecture 
and clustered servers. And vendors such as EMC already have a large installed base of SAN-attached 
arrays. 1 
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Book Review - Building Storage Networks, 2nd Edition 

By Enterprise Storage Forum Staff 

Building Storage Networks (2nd Ed) by Marc Farley is nothing if not complete. At almost 600 pac- 
has the capacity to deliver a wide reaching range of material on storage network which it duly doe; 

Farley, who also wrote the first edition of the book, makes it clear in the introduction that much ha 
changed both in the storage industry, and since the book's first edition. He also explains that while 
has included information about emerging technologies such as Infihiband, there is plenty that can 
change as these nascent technologies develop. Farley is to be commended in this respect, as it wou 
have been easy for him to skimp on coverage of such topics. Instead, he has obviously expended a 
deal of time and effort in researching new technologies and including the information in the book. 

The seventeen (yes, that's one seven) chapters of the book take the reader on a progressive journey 
through almost every aspect of networked storage. There are five sections within the book which f 
a natural progression - Introduction to Network Storage, Fundamental Storing Applications, The 
Storage Channel Becomes a Network, Wiring Technologies and Filing, Internet Storage, and 
Management. 

Perhaps the most impressive aspect of the book is the sheer detail used to explain the underlying 
principles of network storage before progressing to the more common coverage of SAN*s, NAS an 
Fibre Channel. In fact, by the time the detailed discussion on SAf^s and NAS is reached, the readt 
have already covered RAID, caching, backup, I/O channels, mirroring and replication, and nelwor 
backup. Detailed coverage on SANs doesn't start until page 257 by which time the reader is in a si 
position to better understand why SAN's are used and of the underlying technologies. 

In tl c middle of ne book an S page 'blueprint' section runs down, in graphical form, some c 1 the 
T\ picaJ Implementations of Filing, Storing, and Wiring in Host S; stems and St< irage Subsystems 
Figures a oun J tl roughon* Uk. hook and do dii e <.cllf t jol of rei ,f< : ng the pnncip>! - deseribi < 
the text. \ large nnmhe of them mike use of flow -chart type shapes wheaa^ others, when ipp> op 
use representations of servers, hubs and other network storage paraphernalia. ' 

One slightly odd inclusion is that of exercises at the end of each section, which incite the reader to 
perform tasks like "Diagram the complete process and I/O path used in retrieving data from virtual 
memory on disk." While such an exercise might be a good reinforcement of the information preset 
in the chapter, these tasks seem more suited to a certification study guide than a reference book. 
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Abstract 

This article is written for IT managers and examines the cluster technologies available on the 
Microsoft® Windows® server operating system. Also discussed is how cluster technologies can be 
architected to create comprehensive, mission-critical solutions that meet the requirements of the 
enterprise. 
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Abstract 

The Global File System (GFS) is a prototype design for a distributed file system in which 
cluster nodes physically share storage devices connected via a network-like Fibre 
Channel. Networks and network-attached storage devices have advanced to a level of 
performance and extensibility so that the previous disadvantages of shared disk 
architectures are no longer valid. This shared storage architecture attempts to exploit the 
sophistication of storage device technologies whereas a server architecture diminishes a 
device's role to that of a simple component. GFS distributes the file system 
responsibilities across processing nodes, storage across the devices, and file system 
resources across the entire storage pool. GFS caches data on the storage devices instead 
of the main memories of the machines. Consistency is established by using a locking 
mechanism maintained by the storage devices to facilitate atomic read-modify-write 
operations. The locking mechanism is being prototyped on Seagate disk drives and 
Ciprico disk arrays. GFS is implemented in the Silicon Graphics IRTX operating system 
and is accessed using standard Unix commands and utilities. 



Introduction 

Distributed systems can be evaluated by three factors: performance, availability, and 
extensibility. Performance can be characterized by such measurements as response time 
and throughput. Distributed systems can achieve availability by allowing their working 
Components to act as replacements for failed components. Extensibility is a combination 
of portability and scalability. Obvious influences on scalability are such things as 
addressing limitations arid network ports, but subtle bottlenecks in hardware and software 
may also arise. 



These three factors arc influenced by the architecture of the distributed and parallel 
systems. The architectures can be categorized as message-based (shared nothing) and 



1 This work was supported by the Office of Naval Research under grant no. N00019-95- 1-06 1 1 , by the 
National Science Foundation under grant ASC-9523480, and by grant no. 5555-23 from the University 
Space Research Association which is administered by NAS A's Center for Excellence in Space Data and 
Information Sciences (CESDIS) at the NASA Goddard Space Flight Center. 



, Scalability and Failure Recovery in a JLmux Cluster File System 



Page 1 of 17 



e f - m 4th Annual Linux Showcase 
c | & Conference, Atlanta : 



Pp. 169-180 of the Proceedings 

Scalability and Failure Recovery in a Linux Cluster File Systeiri 

Kenneth W. Preslan, Andrew Barry, Jonathan Brassow, 
Michael Declerck, A J. Lewis, Adam Manthei, 
Ben Marzinski, Erling Nygaard, Seth Van Oort, 
David Teigland, Mike Tilstra, Steven Whitehouse, 
and Matthew O'Keefe 



Sistina Software, Inc. 
1313 5th St. S.E. 
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Abstract: 

In this paper we describe how we implemented journaiing and recovery in the Global File System 
(GFS), a shared-disk, cluster file system for Linux. We also present our latest performance results for a 
16-way Linux cluster. 

Introduction 

Traditional local file systems support a persistent name space by creating a mapping between blocks 
found on disk drives and a set of files, file names, and directories. These file systems view devices as 
local: devices are not shared so there is no need in the file system to enforce device sharing semantics. 
Instead, the focus is on aggressively caching and aggregating file system operations to improve ^ 
performance by reducing the number of actual disk accesses required for each file system operation 

m. 
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The AFS File System 
In Distributed Computing Environments 

Introduction 

The AFS 3 distributed file system targets the issues critical to distributed computing 
environments. AFS performs exceptionally well, both within small, local work groups 
of machines and across wide-area configurations in support of large, collaborative 
efforts. AFS provides an architecture geared towards system management, along 
with the tools to perform important management tasks. For a user, AFS is a familiar 
yet extensive UNIX environment for accessing files easily and quickly. 

AFS 3 is currently in use at hundreds of sites worldwide, with the number of AFS 
sites continuing to grow at a robust rate. Looking to the future, the Open Software 
Foundation's (OSF) Distributed Computing Environment (DCE) includes a 
Distributed File Service (DFS) based on AFS 3. Along with the DCE, the DFS will 
be provided to a multitude of new sites by way of major hardware vendors and 
other third party developers (including Transarc). 

This document discusses the attributes of AFS 3 and how it is used today in 
client/server environments. The document compares AFS 3 to the Network File 
System (NFS), explaining the advantages of AFS over an NFS environment. An 
additional topic is the use of AFS 3 as a stepping stone to the DCE, providing both 
a learning environment and a way to migrate existing resources. 

AFS 3 Attributes 

A single, shared name space for all users, from all machines. AFS brings together 
all of the files stored within the file system into a single name space. Every AFS 
user shares this same name space, making all AFS files easily available from any 
AFS machine. 

Location-independent file sharing. With AFS, the name of a file is independent of 
both the file's and the user's physical location, contributing to ease of file sharing 
and resource management. 

Client caching and efficient wide-area protocols for excellent performance. Both 
small and large-scale distributed environments benefit from AFS mechanisms to 
reduce server and network load. AFS caches data on client machines to reduce 
subsequent data requests directed at file servers, substantially reducing network 
and server loads. Servers keep track of client caches through callbacks, so a client 
does not need to constantly query the server to see if the file has changed. 

The AFS remote procedure call reads and writes data to a remote procedure call 
(RPC) stream, further improving the efficiency of data transfer across a local- or 
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Distributed File System: A Logical View of Physical Storage 

White Paper 

Abstract 

The Microsoft® Distributed File System (Dfs) is a network server component that makes it easier for you to find 
and manage data on your network. Dfs is a means for uniting files on different computers into a single name 
space. Dfs makes it easy to build a single, hierarchical view of multiple file servers and file server shares on your 
network. 

Microsoft Distributed File System version 4.1 for Microsoft Windows NT® Server 4.0 is currently available for 
download from the Microsoft Web site at http://www.mlcrosoft.com/ntserver . This release includes the Microsoft 
Windows® 95 operating system Dfs client and enhanced' security signatures. In addition, Microsoft Windows® 2000 
will include directory service-enabled enhancements to Dfs. This paper covers Dfs technology as a whole, including 
the version for Windows 2000. 

Introduction 

The Distributed File System (Dfs) for the Microsoft® Windows NT® Server and Microsoft Windows® 2000 Server 
operating systems is a network server component that makes it easier for you to find and manage data on your 
network. Dfs is a means for uniting files on different computers into a single name space. Dfs makes it easy to 
build a single, hierarchical view of multiple file servers and file server shares on your network. Instead of seeing a 
physical network of dozens of file servers, each with a separate directory structure, users will now see a few logical 
directories that include all of the important file servers and file server shares. Each share appears in the most 
logical place in the directory, no matter what server it is actually on. 

Dfs does for servers and shares what file systems do for hard disks. File systems provide uniform named access to 
collections of sectors on disks;. Dfs provides a uniform naming convention and mapping for collections of servers, 
shares, and files. Thus, Dfs makes it possible to organize file servers and their shares into a logical hierarchy, 
making it considerably easier for a large corporation to manage and use its information resources. In addition, Dfs 
is not limited to a single file protocol, and can support the mapping of servers, shares, and files, regardless of the 
file client being used, provided that the client supports the native server and share. 

What is a Distributed File System? 

Dfs provides name transparency to disparate server volumes and shares. Through Dfs, an administrator can build a 
single hierarchical file system whose contents are distributed throughout your organization's WAN. In short, Dfs 
can be thought of as a share of other shares. 

Historically, with the universal naming convention (UNC), a user or application was required to specify the physical 
server and share in order to access file information (that is, the user or application had to specify 
\\Servet\Share\Path\FHename). Even though U.NCs can be.used directly, a UNC is typically mapped to a drive letter 
where x? might be mapped to \\Server\Share. From that point, a user had to navigate beyond the redirected drive 
mapping to the data he or she wishes to access (for example, copy x:\Path\More_ path\....,\FIIename, 

As networks continue to grow jn size and as organizations begin to use existing storage— both internally and 
externally— forpurpos s s h as niMneL, ■ npp.rg a single-,drh e later m individual sh; r e <= scdi^s roony Further, 
although users. can use UNO names directly, these '.users can be overwhelmed by the number of places where data 
can be stored. .. 

Dfs solves these problems by permitting the linking of servers and i shares ihto a simpler^ more meaningful name 
.space. This new Dfs volume permits shares to be hierarchically connected to other Windows shares. Since Dfs 
maps the physical storage into a logical representation, the net benefit is that the physical location of data 
becomes transparent to users and applications. 

Benefits of Dfs 
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Abstract 

In this paper, we propose a new paradigm for network file 
system design, serverless network file systems. While tradi- 
tional network file systems rely on a central server machine, 
a serverless system utilizes workstations cooperating as 
peers to provide all file system services. Any machine in the 
system can store, cache, or control any block of data. Our 
approach uses this location independence, in combination 
with fast local area networks, to provide better performance 
and scalability than traditional file systems. Further, because 
any machine in the system can assume the responsibilities 
of a failed component, our serverless design also provides 
high availability via redundant data storage. To demonstrate 
our approach, we have implemented a prototype serverless 
network file system called xFS. Preliminary performance 
measurements suggest that our architecture achieves its goal 
of scalability. For instance, in a 32-node xFS system with 32 
active clients, each client receives nearly as much read or 
write throughput as it would see if it were the only active 
client. 

1. Introduction 

A serverless network file system distributes storage, 
cache, and control over cooperating workstations. This ap- 
proach contrasts with traditional file systems such as Net- 
ware [Majo94], NFS [Sand85], Andrew [Howa88], and 
Sprite [Nels88] where a central server machine provides all 
file system services. Such a central server is both a perfor- 
mance and reliability bottleneck. A serverless system, on the 
other hand, distributes control processing and data storage to 
achieve scalable high performance, migrates the responsibil- 
ities of failed components to the remaining machines to pro- 
vide high availability, and scales gracefully to simplify 
system management. 

Three factors motivate our work on serverless network 
file systems: the opportunity provided by fast switched 

This work is supported in part by the Advanced Research Projects Agency 
(N00600-93-C-2481, F30602-95-C-00 14), the National Science Foundation (CD A 
.0401156), California MICRO, the AT&T Foundation, Digital Equipment Corporation, 
Exabyte, Hewlett Packard, IBM,' Siemens Corporation, Sua Microsystems, and Xerox 
Corporation. Anderson' was also supported by a National Science Foundation Presi- 
< nn i F .It I- i' v. hir * t in Jatipnal'S nee F m^ti r. On uatf P- 
scarchFel stir u Roselliby aOcpartn r, 'Eru nionC v'NN lb hip The 
•authors Can Be contacted at {tea, ■'•■dahlin, neefe, patterson, drew. rvwang}"@CS.BerJce- 
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Copyright © 1995 by UieAssociation for Computing Machinery, Inc. Permission 
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LANs, the expanding demands of users, and the fundamental 
limitations of central server systems. ! 

The recent introduction of switched local area networks 
such as ATM or Myrinet [Bode95] enables serverlessness by 
providing aggregate bandwidth that scales with the number 
of machines on the network. In contrast, shared media net- 
works such as Ethernet or FDDI allow only one client or 
server to transmit at a time. In addition, the move towards 
low latency network interfaces [vE92, Basu95] enables clos- 
er cooperation between machines than has been possible in 
the past. The result is that a LAN can be used as an I/O back- 
plane, harnessing physically distributed processors, memo- . 
ry, and disks into a single system. 

Next generation networks not only enable serverlessness, 
they require it by allowing applications to place increasing 
demands on the file system. The I/O demands of traditional 
applications have been increasing over time {Bake91]; new 
applications enabled by fast networks — such as multime- 
dia, process migration, and parallel processing — will fur- 
ther pressure file systems to provide increased performance. 
For instance, continuous media workloads will increase file 
system demands; even a few workstations simultaneously 
running video applications would swamp a traditional cen- 
tral server [Rash94]. Coordinated Networks of Workstations 
(NOWs) allow users to migrate jobs among many machines 
and also permit networked workstations to run parallel jobs 
[Doug91, Litz92, Ande95J. By increasing the peak process- 
ing power available to users, NOWs increase peak demands 
on the file system [Cyph93J. 

Unfortunately, current centralized file system designs 
. fundamentally limit performance and availability since all 
read misses and all disk writes go through the central server. 
To address such performance limitations, users resort to 
costly schemes to try to scale these fundamentally unscal- 
able file systems. Some installations rely on specialized 
server machines configured with multiple processors, I/O 
channels, and I/O processors. Alas, such machines cost sig- 
nificantly more than desktop workstations for a given 
amount of computing or I/O capacity.' Many installations 
i n i^t t j i l.j c v.-u i dit/ ^vdi ^ I tut dfk y.s 
rem am mu'upif erve by pari h «.ur ^ tit director, 
tree Tn< appro ichoi ly m'odt rately in t „ > ess alal ilitybe 
«. e its cdarse distrit ion it ults in hot hei 
p p rti oning al tc n I ed files and dire tor 
trees to a single server . [Wolf89]. It is also expensive, since 
it requires the (human) system manager to effectively be- 
come part of the file system — moving users, volumes, and 
disks among servers to balance load. Finally, AFS [Howa88] 
attempts to improve scalability by caching data on client 
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ABSTRACT 

The goal of.CMU's Network-Attached Secure Disks (NASD) 
project is to define the next era of storage system interfaces 
and architectures. To encourage industry standardization of 
a compliant storage device/subsystem interface, we are 
working closely with the National Storage Industry Consor- 
tium's working group on network-attached storage. Our 
experimental demonstration of the NASD interface's value is 
device and filesystem prototype software that delivers the 
scalability inherent in a NASD storage architecture. To 
engage the academic community and to provide a reference 
implementation for industry development, CMU is releasing 
its Linux and Digital UNDC ports of this software. In this 
paper, we overview the NASD scalable storage architecture 
and the code-base we are releasing for Linux. 



1. INTRODUCTION 

Demands for storage throughput continue to grow due to 
ever larger clusters sharing storage, rapidly increasing client 
performance, richer data types such as video, and data-inten- 
sive applications such as data mining. For storage 
subsystems to deliver scalable throughput, that is, linearly 
increasing application bandwidth and accesses per second 
with increasing numbers of storage devices and client 
processors, the data must be striped over many disks and' 
network links fPatterson88], and name lookup and access 
rights checking must be decentralized [Hartman93, 
Anderson96]. With current technology, most office, engi- 
neering, and data processing shops have sufficient numbers 
of disks and scalable switched networking, but they access 
storage through storage controller and distributed fileserver 
bottlenecks. These bottlenecks arise because a single 
"server" computer copies data between the storage (periph- 
eral) network and the client (local area) network while 
adding functions such as concurrency control and metadata 
. consistency. • 1 

■ Qur prior work proposed a new scalable-bandwidth storage 
ovi itecture Ntn il A*1a \wc Secure Di Y (N A.SD; 
iGib on97a, Gibsofl97I C)b,>tP)/, ( & A-nm« 
Mfi-9"j Fundamentally NASD mini it erver-based 
data movement by separating management and filesystem 
semantics from store-and-forward .copying and elevating 
commodity storage's interface to a richer object-based 
model (SCSI4 perhaps). 



As with earlier generations of SCSI, the NASD interface is 
simple, efficient and flexible enough to support a wide range 
of filesystem semantics across multiple generations of tech- 
nology. Of course, advancing storage interfaces and archi- 
tecture requires industry collaboration and standardization. 
Fortunately, the storage industry is aggressively seeking to 
evolve their marketplace [Quantum99, Seagate99], To 
promote network-attached storage, CMU is working closely 
with the National Storage Industry Consortium's (NSIC) 
working group on network-attached storage devices 
(www.nsic.org/nasd). Over the past three years, NSIC has 
hosted about a dozen public workshops where academics 
and practitioners exchange perspectives on next generation 
storage. Currently, the core NSIC working group is engaged 
in developing an ANSI standards proposal for a new storage 
interface. 

Until recently, CMU publications have been sufficient for 
collaboration in the NSIC effort. Now, to more widely 
disseminate our work, CMU is providing, for public use, a 
reference implementation of NASD for the Linux 2.2 and 
Digital UNIX 3.2 environments. Our reference implementa- 
tion includes NASD device code (running on a workstation 
or PC masquerading as a subsystem or disk drive), an NFS- 
like distributed file system designed to use NASD 
subsystems or devices, and NASD-inspired striping middle- 
ware to provide scalable bandwidth to large striped files. The 
rest of this extended abstract describes this prototype soft- 
ware and summarizes prior research predictions for its 
performance. 

2. BACKGROUND AND RELATED WORK 

Figure 1 illustrates the principal network-attached, storage 
architectures. The simplest implementation runs on a standa- 
lone server with attached disks (SAD), as shown in 
Figure la. Data makes two. network trips, on its way to the 
client, making the server a potential. bottleneck; particularly 
since a server usually manages a large numbers of disks to 
amortize cost Companies such as Network Appliance have 
improved the performance of SAD implementations, specifi- 
cally the number of clients supported by using special pur- 
pose server hardware and highly optimized software (SID) 
LHnz94J. 
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Abstract 

By providing direct data transfer between storage and client, net- 
work-attached storage devices have the potential to improve scal- 
ability for existing distributed file systems (by removing the server 
as a bottleneck) and bandwidth for new parallel and distributed file 
systems (through network striping and more efficient data paths). 
Together, these advantages influence a large enough fraction of the 
storage market to make commodity network-attached storage fea- 
sible. Realizing the technology's full potential requires careful 
consideration across a wide range of file system, networking and 
security issues. This paper contrasts two network-attached storage 
architectures— (1) Networked SCSI disks (NetSCSI) are network- 
attached storage devices with minimal changes from the familiar 
SCSI interface, while (2) Network-Attached Secure Disks (NASD) 
are drives that support independent client access to drive object 
services. To estimate the potential performance benefits of these 
architectures, we develop an analytic model and perform trace- 
driven replay experiments based on AFS and NFS traces. Our 
results suggest that NetSCSI can reduce file server load during a 
burst of NFS or AFS activity by about 30%. With the NASD archi- 
tecture, server load (during burst activity) can be reduced by a fac- 
tor of up to five for AFS and up to ten for NFS. 

1 Introduction 

Users are increasingly using distributed file systems to access 
data across local area networks; personal computers with hundred- 
plus MIPS processors are becoming increasingly affordable; and 
the sustained bandwidth of magnetic disk storage is expected to 
exceed 3G MB/s by the end of the decade. These trends place a 
pressing need on distributed file system architectures to provide 
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clients with efficient, scalable, high-bandwidth access to stored 
data. This paper discusses a powerful approach to fulfilling this 
need. Network-attached storage provides high bandwidth by 
directly attaching storage to the network, avoiding file server 
store-and-forward operations and allowing data transfers to be 
striped over storage and switched-network links. 

The principal contribution of this paper is to demonstrate the 
potential of network-attached storage devices for penetrating the 
markets defined by existing distributed file system clients, specifi- 
cally the Network File System (NFS) and Andrew File System 
(AFS) distributed file system protocols. Our results suggest that 
network-attached storage devices can improve overall distributed 
file system cost-effectiveness by offloading disk access, storage 
management and network transfer and greatly reducing the amount 
of server work per byte accessed. 

We begin by charting the range of network-attached storage 
devices that enable scalable, high-bandwidth storage systems. Spe- 
cifically, we present a taxonomy of network-attached storage — 
server-attached disks (SAD), networked SCSI (NetSCSI) and net- 
work-attached secure disks (NASD) — and discuss the distributed 
file system functions offloaded to storage and the security models 
supportable by each. 

With this taxonomy in place, we examine traces of requests 
on NFS and AFS file servers, measure the operation costs of com- 
monly used SAD implementations of these file servers and 
develop a simple model of the change in manager costs for NFS 
and AFS in NetSCSI and NASD environments. Evaluating the 
impact on file server load analytically and in trace-driven replay 
experiments, we find that NASD promises much more efficient 
file server offloading in comparison to the simpler NetSCSI. With 
this potential benefit for existing distributed file server markets, 
we conclude that it is worthwhile to engage in detailed NASD 
implementation studies to demonstrate the efficiency, throughput 
and response time of distributed file systems using network- 
attached storage devices. 

In Section 2, we discuss related work. .Section 3 presents.our 
( a -. m> pf.net jrk-attachcd storage architectures. In Section-*, 
we ^^^0S:«ai ;AljS traces used in our analysis and 
repla) e> erir .cms and rep< rt i it measurt mer ts of the c ost -of 
each .server operation jn CPU cycles. Section. 5 develops : an ana- 
lytic model to estimate the potential, scaling offered by server-off- 
loading in NetSCSI and NASD based oh the collected traces and 
the measured costs of server operations. The trace-driven replay 
experiment and the results are the subject of Section 6. Finally, 
Section 7 presents our conclusions and discusses future directions. 
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Abstract 

Managing large objects with high data-rate require- 
ments is difficult for current computing systems. We 
describe an Input/Output architecture, called Swift, 
that addresses the problem of storing and retrieving 
very large data objects from slow secondary storage 
at very high data-rates. Applications that require 
this capability are poorly supported in current sys- 
tems, even though they are made possible by high- 
speed networks. These range from storage and visu- 
alization of scientific computations to recodring and 
play-back of color video in real-time. Swift addresses 
the problem of providing the data rates required by 
digital video by exploiting the available interconnec- 
tion capacity and by using several slower storage de- 
vices in parallel. 

We have done two studies to validate the Swift ar- 
chitecture: a simulation study and an Ethernet-based 
proof-of-concept implementation. Both studies indi- 
cate that the aggregation principle proposed in Swift 
can yield very high data-rates. We present a brief 
summary of these studies. 



1 Introduction 

The disparity between processing speed, network 
transfer rates, and the performance of disk storage 
systems will increase in. the future. The processing 
speed of computing systems continues to increase at 
an exponential rate. Advances in communications 
technology are providing increased transfer rates even 
more rapidly than the increases in processing speed. 

In contrast to these advances, disk storage technol- 
ogy remains much the same. Although the density of 
the media bas greatly increased, there has been little 
improvement in either access times or data transfer 
rates. In the case of optical storage, the access times 
have increased and the data transfer rates have de- 
creased relative to magnetic media. Due to physical 
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considerations, substantial increases in disk storage 
data transfer rates seem unlikely. 

Because of increased processing power and the po- 
tential for high network transfer rates, new applica- 
tions are emerging. These applications range from 
bulk data transfer for super computers to managing 
digital color video in real-time. Today, managing dig- 
itized color video in real-time is impossible. Storing 
just a few minutes of digitized color video requires 
gigabytes of storage. Storing or retrieving it in real- 
time requires sustained transfer rates on the order of 
20 megabytes per second. 

Our architecture, called Swift, addresses the prob- 
lem of storing and retrieving large data objects from 
slow secondary storage at very high data-rates. Swift 
is based on the premises that: (1) the network inter- 
connection will be capable of supporting much higher 
data-rates than individual storage agents; (2) re- 
sources can be preallocated for storing and transmit- 
ting data; (3) multiple storage agents can be driven 
concurrently using data striping; and (4) failures of 
storage agents can be masked using data redundancy. 

Swift is based on a client-server model and ad- 
dresses the issues of authentication, access control, 
and encryption. Since it. is a distributed architec- 
ture made up of independently replaceable compo- 
nents, it can provide very high reliability. 'It is adapt- 
able to different network interconnection topologies 
and technologies. Swift operates by having a stor- . 
age mediator reserve resources from storage agents 
in a session-oriented manner, and then presenting a 
distribution agent with a transfer plan. The distribu- 
tion agent stores or retrieves the data at the storage 
agents following that plan. 

Even though. Swift was designed with very large 
objects in mind, it can handle small objects such as 
those encountered in normal file systems with two 
penalties: one round trip time for a short network 
message to consult the storage mediator, and com- 
puting the required data redundancy. Swift is also 
well suited as a swapping device for high performance 
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We use the technique of storing the data of a single object across several storage servers, 
called data striping, to achieve high transfer data rates in a local area network. Using 
parallel paths to data allows a client to transfer data to and from storage at a higher 
rate than that supported by a single storage server. We have implemented a network data 
service, called Swift, that uses data striping. Swift exhibits the expected scaling property in 
the number of storage servers connected to a network and in the number of interconnection 
networks present in the system. 

We have also simulated a version of Swift to explore the limits of possible future configu- 
rations. We observe that the system can evolve to support very high speed interconnection 
networks as well as large numbers of storage servers. Since Swift is a distributed system 
made up of independently replaceable components, any component that limits the perfor- 
mance can either be replaced by a faster component when it becomes available or can be 
replicated and used in parallel. This should allow the system to incorporate and exploit 
• emerging storage and networking technologies. 

1 Introduction 

The current generation of distributed computing systems do not support I/O-intensive appli- 
cations well. In particular, they are incapable of integrating high-quality video with other 
data in a general purpose environment. For example, multimedia applications that require 
this level of service include scientific visualization, image processing, and recording and play- 
back of color video. The data rates required by some of these applications range from 1.2 
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Abstract 

The Swift I/O architecture is designed to provide high data rates in support of multimedia type 
applications in general purpose distributed environments through the use of distributed striping. Striping 
techniques place sections of a single logical data space onto multiple physical devices. The original 
Swift prototype was designed to validate the architecture, but did not provide fault tolerance. We have 
implemented a new prototype of the Swift architecture that provides fault tolerance in the distributee! 
environment in the same manner as RAID levels 4 and 5. RAID (Redundant Arrays of Inexpensive 
Disks) techniques have recently been widely used to increase both performance and fault tolerance of 
disk storage systems. 

The new Swift/RAID implementation manages all communication using a distributed transfer plan 
executor which isolates all communication code from the rest of Swift. The transfer plan executor is 
implemented as a distributed finite state machine which decodes and executes a set of reliable data transfer 
operations. This approach enabled us to easily investigate alternative architectures and communications 
protocols. 

Providing fault tolerance comes at a cost, since computing and administering parity data impacts 
Swift/RAID data rates. For a five node system, in one typical performance benchmark, Swift/RAID level 
5 obtained 87% of the original Swift read throughput and 53% of the write throughput. Swift/RAID level 
4 obtained 92% of the original Swift read throughput and 34% of the write throughput. 

Keywords: Swift architecture, RAID, data striping, client-server data transmission, network data 
service, distributed atomic operations, concurrent programming, distributed state machines, real-time dis- 
tributed programming. 

1 Introduction 

The Swift system was designed to investigate the use of network disk striping to achieve the data rates required 
by multimedia in a general purpose distributed system. The original Swift prototype was implemented during 
1991, and its design and performance was described, investigated, and reported [Cabrera and Long, 1991, 
Emigh, 1992]. A high-level view of the Swift architecture is shown in Figure 1. Swift uses a high speed 
interconnection medium to aggregate arbitrarily many .(slow) storage devices into a faster logical Storage 
service, making all applications unaware of this aggregation. Swift uses a modular client-server architecture 
made up of independently replaceable components. 

Disk striping is a technique analogous to main memory interleaving that has been used for some time to 
enhance throughput and balance disk load in disk arrays {Kim, 1986, Salem and Garcia-Molina, 1986]. In 
such systems writes scatter data across devices (the members of the stripe) while reads 'gather' data from 

tSupported in part by the? National Science Foundation under Grant NSF CCR-91 1 1220 and by the Office of Naval Research 
under Grant NO0O14-92-J-18O7 
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Abstract 



As Linux clusters have matured as platforms for low- 
cost, high-performance parallel computing, software 
packages to provide many key services have emerged, 
especially in areas such as message passing and net- 
working. One area devoid of support, however, has 
been parallel file systems, which are critical for high- 
performance I/O on such clusters. We have developed a 
parallel file system for Linux clusters, called the Parallel 
Virtual File System (PVFS). PVFS is intended both as 
a high-performance parallel file system that anyone can 
download and use and as a tool for pursuing further re- 
search in parallel I/O and parallel file systems for Linux 
clusters. 

In this paper, we describe the design and implementa- 
tion of PVFS and present performance results on the 
Chiba City cluster at Argonne. We provide performance 
results for a workload of concurrent reads and writes 
for various numbers of compute nodes, VO nodes, and 
I/O request sizes. We also present performance results 
for MPI-IO on PVFS, both for a concurrent read/write 
workload and for the BTIO benchmark. We compare the 
I/O performance when using a Myrinet network versus a 
fast-ethernet network for I/O-related communication in 
PVFS. We obtained read and write bandwidths as high as 
700 Mbytes/sec with Myrinet and 225 Mbytes/sec with 
fast ethernet 
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1 Introduction 



Cluster computing has recendy emerged as a main- 
stream method for parallel computing in many applica- 
tion domains, with Linux leading the pack as the most 
popular operating system for clusters. As researchers 
continue to push the limits of the capabilities of clus- 
ters, new hardware and software have been developed to 
meet cluster computing's needs. In particular, hardware 
and software for message passing have matured a great 
deal since the early days of Linux cluster computing; in- 
deed, in many cases, cluster networks rival the networks 
of commercial parallel machines. These advances have 
broadened the range of problems that can be effectively 
solved on clusters. 

One area in which commercial parallel machines have 
always maintained great advantage, however, is that 
of parallel file systems. A production-quality high- 
performance parallel file system has not been available 
for Linux clusters, and without such a file system, Linux 
clusters cannot be used for large I/O-intensive parallel 
applications. We have developed a parallel file system 
for Linux clusters, called the Parallel Virtual File System 
(PVFS) [33], that can potentially fill this void. PVFS is 
being used at a number of sites, such as Argonne Na- 
tional Laboratory, NASA Goddard Space Flight Center, 
and Oak Ridge National Laboratory. Other researchers 
are also using PVFS in their studies 128]. 

We had two riiain objectives in developing PVFS. First, 
we needed a-b'asic software platform for pursuing further 
research in parallel I/O and parallel file systems in the 
context of Linux clusters. For this purpose, we needed 
a stable, full-featured parallel file system to begin with. 
Our second objective was to meet the need for a paral- 
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"Disk striping" refers to the process of storing a single file across multiple disk partitions. Each 
partition contains a different part of the file. Disk striping can greatly increase I/O since each 
partition can be accessed in parallel. Striping significantly improves 1/0 time for file sizes of 10s 
of MB or greater. 



The /usr/tmp partition on mcurie now has system-level striping performed automatically 
across four disks. This removes much of the advantage of user-level striping as discussed 
below. The remaining text on this page was written before the installation of the current 
disks and is left here for reference. ■ ._ 



Youuse the assign command to set parameters that control how the striping is performed. 
Fortran OPEN < ) READ ( ) and WRITE (^statements are used as usual without modification. For 
illustration, we'll consider one PE writing a single file across multiple disk partitions on the 
NERSC Cray T3E-900. 
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Abstract 

The ideal distributed file system would provide all its users with co- 
herent, shared access to the same set of files.yet would be arbitrarily 
scalable to provide more storage space and higher performance to 
a growing user community. It would be highly available in spite of 
■component failures. It would require minimal human administra- 
tion, and administration would not become more complex as more 
components were added. 

Frangipani is a new file system that approximates this ideal, yet 
was relatively easy to build because of its two-layer structure. The 
lower layer is Petal (described in an earlier paper), a distributed 
storage service that provides incrementally scalable, highly avail- 
able, automatically managed virtual disks. In the upper layer, 
multiple machines run the same Frangipani file system code on top 
of a shared Petal virtual disk, using a distributed lock service to 
ensure coherence. 

Frangipani is meant to run in a cluster of machines that are under 
a common administration and can communicatesecurely. Thus the 
machines trust one another and the shared virtual disk approach is 
practical. Of course, a Frangipani file system can be exported to 
untrusted machines using ordinary network file access protocols. 

We have implemented Frangipani on a collection of Alphas 
running DIGITAL Unix 4.0. Initial measurements indicate that 
Frangipani has excellent single-server performance and scales well 
as servers are added. 



1 Introduction 

File system administration for a large, growing computer installa- 
tion built with today 's technology is a laborious task. To hold more 
files and serve more users, one must add more disks, attached to 
more machines. Each of these components requires human admin- 
istration. Groups of files are often manually assigned to particular 
disks, then manually moved or replicated when components fill 
up, fail, or b.ecome.perfonnance hot spots. Joining multiple disk 
drives into one unit using RAID technology is only a partial so- 
lution; administration problems still arise once the system grows 



large enough to require multiple RAIDs and multiple server rna- 

Frangipani is a new scalable distributed file system that manages 
a collection of disks on multiple machines as a single shared pool 
of storage. The machines are assumed to be under a common 
administration and to be able to communicatesecurely. There have 
been many earlier attempts at building distributed rile systems that 
scale well in throughput and capacity [1,11, 19, 20, 21, 22, 26, 
31, 33, 34]. One distinguishing feature of Frangipani is that it has 
a very simple internal structure — a set of cooperating machines 
use a common store and synchronize access to that store with 
locks. This simple structure enables us to handle system recovery, 
reconfiguration, and load balancing with very little machinery. 
Another key aspect of Frangipani is that it combines a set of features 
that makes it easier to use and administer Frangipani than existing 
file systems we know of. 

1 . All users are given a consistent view of the same set of files. 

2. More servers can easily be added to an existing Frangipani 
installation to increase its storage capacity and throughput, 
without changing the configuration of existing servers, or 
interrupting their operation. The servers can be viewed as 
"bricks" that can be stacked incrementally to build as large a 
file system as needed. 

3. A system administrator can add new users without concern 
for which machines will manage their data or which disks 
will store it 

4. A.system administrator can make a full and consistentbackup 
of the entire file system without bringing it down. Backups 
can optionally be kept online, allowing users quick access to 
accidentally.deleted files. 

5. The file system tolerates and recovers from machine, network, 
and disk failures without operator .intervention. 

Frangipani is layered on top of Petal {24], an easy-to-administer 
ai«tnb.itej stot p- m ti jf p ovtdt . vii tudl di J to its n'< 
Like a physical disk, a Petal virtual disk provides storage that ear. 
be read or written in Mocks,' Unlike a physical disk, a virtual 
disk-provides a sparse 2 M "byte address space, with physical stor- 
age allocated only on demand. Petal optionally replicates data for 
high availability. Pete] also provides efficient snapshots [7, 10] to 
support consistentbackup. Frangipani inherits much of its scala- 
bility, fault tolerarice,'and easy administration from the underlying 
storage system, but careful design was required to extend these 
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^Adopting a new 
hierarchical check- 
pointing architecture, 
the authors develop a 
single I/O address space 
for building highly 
available clusters of 
computers. They 
propose a systematic 
approach to achieving 
single system image by 
integrating existing 
middleware support 
with the newly 
developed features. 



Designing SSI Clusters 
with Hierarchical 
Checkpointing and 
Single I/O Space 



• v he computing trend is moving from clustering high-end main- 
jj frames to clustering desktop computers. This trend is triggered 
jj by the widespread use of PCs, workstations, gigabit networks, 
i and middleware support for clustering. 1 This article presents 
new approaches to achieving fault tolerance and single system image 
(SSI) in a workstation cluster. In a cluster with high availability through SSI support, 
of computers, local area networks or high- distributed RAID (redundant arrays of 
bandwidth switch networks using optical inexpensive disks) with parity checks, and 
hierarchical checkpointing with adaptive 
recovery. In particular, we developed a sin- 
gle I/O address space among all disks and 
peripheral devices attached in the cluster. 
This enables direct remote disk access, 
which is a necessary step to implement a 



fibers physically connect a collection of 
node computers. The workstations in a 
cluster can work collectively as an inte- 
grated computing resource— that is, an 
SSI— or they can operate as individual 

Present clusters are usu- 
ally small and provide only x 
limited SSI services. Future j^Tj 
clusters will likely increase r : ' . 
in scalability and offer more 1 . 
SSI support, l Figure 111 
lustrates. Theimplicatidn is 
that future clusters' could 
replace the MPE, SMP, or 
CO-NUMA architectures 
(see "The cluster as a com- 
puter architecture" sidebar 
for key characteristics pf 
these computer platforms). 

We focus tin. clusters 



Figure 1 . Design space of competing computer 
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Introduction 

Delivering a great quality application with a rich feature set isn't enough In all cases— Increasingly, It must also meet high availability 
criteria. Have you avoided taking your application to the next level because cluster technology seems too daunting to understand and 
use? With Microsoft's® Cluster Service- Introduced in Windows® NT™ 4 and available In the Windows Server 2003 family, 
developers have at their disposal straightforward tools to deploy applications In a clustered environment. These include the ability to 
enlist any application in a cluster as a generic application, and the ability to control application configuration by means of Window 
scripting. 

A cluster connects two or more servers together so that they appear as a single computer.to clients. Connecting servers In a cluster 
allows for workload sharing, enables a single point of operation/management, and provides a path for scaling to meet increased 
demand. Thus, clustering gives you the ability to produce high availability applications. 

This paper focuses on Cluster Service, one of three Microsoft server technologies that support clustering. We demonstrate how to 
easily perform a sanity check of your application within a cluster environment without having to make any changes to your 
application's code. 

Three Technologies for Clustering 

Microsoft servers provide three technologies to support clustering: Network Load Balancing (NLB), Component Load Balancing (CLB), 
and Microsoft Cluster Service (MSCS). ■ 

Network Load Balancing 

Network Load Balancing acts as a front-end cluster, distributing Incoming IP traffic across a cluster of servers, and Is Ideal for 
enabling Incremental scalability and outstanding availability for e-commerce Web sites. Up to 32 computers running a member of the 
Windows Server 2003 family can be connected to share a single virtual IP address. NLB enhances scalability by dlstribudng its client 
requests across multiple servers within the cluster. As traffic increases, additional servers can be added to the cluster; up to 32 
servers are possible in any one cluster. NLB also provides high availability by automatically detecting the failure of a server and 
repartitlonlng client traffic among the remaining servers within 10 seconds, while it provides users with continuous service. 
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. Edgar H. Sibley Using only a few simple and commonplace instructions, this algorithm 
Panel Editor efficiently maps variable-length text strings onto small integers. 



Fast Hashing of Variable- 
Length Text Strings 

Peter K. Pearson 



In the literature on hashing techniques, most authors 
spend little time discussing any particular hashing 
function, but make do with an allusion to Knuth [3] in 
their haste to get to the interesting topics of (able orga- 
nization and collision resolution. The relatively rare 
articles on hashing functions themselves [2] tend to 
discuss algorithms that operate on values of predeter- 
mined length or that make heavy use of operations 
(multiplication, division, or shifts of long bit strings) 
that are absent from the instruction sets of smaller 
microprocessors. 

This article proposes a hashing function specifically 
tailored to variable-length text strings. This function 
takes as input a Word IV consisting of some number n of 
characters, C,, C C, each character being repre- 
sented by one byte, and returns an index in the range 
0-255, An auxiliary table T of 256 randomlsh bytes is 
used in the process. Here is the proposed algorithm: 1 

h(0J := 0 ; 

for 'i in 1 . . n loop 

. h(i) := T[ h[i-1J xor C(i) ] i 

end loop ; 

return h(n] ; 

Notice that the processing of each additional charac- 
ter of text requires only an exclusive-OR operation and 
an indexed memory read. Also note that It is not neces- 
sary to know the length of the string at the beginning of 
the computation, a property useful when the end of the 
text string is indicated by a special character rather 
than by a separately stored length variable. 

Two desirable properties of this algorithm for hashing 
variable-length strings derive from the technique of 
cryptographic checksums or message authentication codes 
[4], from which it is adapted. First, a good crypto- 
graphic checksum ensures that small changes lo the 
data result in large and seemingly random changes to 
the checksum. In the hashing adaptation, this results in 
good separation of very similar strings. Second, on a 
good cryptographic checksum the effect of changing 
one part of the data must not be cancelled by an easily 

» In < practical Implementation, (he aubscrlptt onhire omitted. They are 
fhown hero to clarify Utef dlKuulon. 
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computed change to some other part. In hashing, this 
ensures good separation of anagrams, the downfall of 
hashing strategies that begin with a length-reducing ex- 
clusive-OR of substrings. 

The auxiliary table T is obviously crucial to this algo- 
rithm, yet I have found very few constraints on its 
construction. Since the hashing function can only re- 
turn values that appear in T, each index from 0 to 255 
must appear in T. exactly once. In other words, T must 
be a permutation of the values (0 . , . 255), Obviously, if 
T[i] = i, the corresponding h is merely a longitudinal 
exclusive-OR checksum, which is a bad hashing func- 
tion because it does not separate anagrams. I have ex- 
perimented by Riling T with randomly generated per- 
mutations of (0 . . . 255) and have found no outstanding 
good or bad arrangements. (An attempt to promote 
greater dispersal among very similar short strings by 
clever choice of T, however, turned out to be a very 
bad idea.) 

For the interested reader who does not want to gen- 
erate his own random permutations, Table i presents 
the permutation used in the tests described later In this 
article. 

SEPARATION PERFORMANCE 
The purpose of any text hashing funolion is to take text 
strings — even very similar text strings — and map them 
onto Integers that are spread as uniformly as possible 
over the intended range of output values. In the ab- 
sence of prior knowledge about the strings.being 
hashed, a perfectly uniform output distribution cannot 
be expected. The best result that one can expect to 
achieve consistently is a seemingly random mapping of 
input strings onto output values. To see how well h 
does Its Job, one might ask the following questions. 

• If h is applied to a string of random bytes, Is each of 
the 256 possible outcomes equally likely? The an- 
swer, probably not surprisingly, is yes. From the algo- 
rithm given earlier, it is clear that if the last input 
character, C[n], is random— equally likely to take any 
value, and uncorrected with any preceding charac- 
ter—then all final values of h are equally likely. 

• If two Input strings differ by a single bit, will their 
hash function values collide more often than by 
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Configuring redundant disk arrays is a black art. To configure an array ' properly, a system 
administrator must understand the details of both the array and the workload it will support. Incorrect 
understanding of either, or changes in the workload over time, can lead to poor performance. 
We present a solution to this problem: a two-level storage hierarchy implemented inside a single disk- 
array controller. In the upper level of this hierarchy, two copies of active data are stored to provide 
full redundancy and excellent performance. In the lower level, Raid 5 parity protection is used to 
provide excellent storage cost for inactive data, at somewhat lower performance. 
The technology we describe in this paper, known as HP AutoRAiD, automatically and transparently 
manages migration of data blocks between these two levels as access patterns change. The result is a 
fully redundant storage system that is extremely easy to use, is suitable for a wide variety of 
workloads, is largely insensitive to dynamic workload changes, and performs much better than disk 
arrays with comparable numbers of spindles and much larger amounts of front-end RAM cache. 
Because the implementation of the HP AutoRAiD technology is almost entirely in software, the 
additional hardware cost for these benefits is very small. 

We describe the HP AutoRAiD technology in detail, provide performance data for an embodiment of 
it in a storage array, and summarize the results of simulation studies used to choose algorithms 
implemented in the array. 

Categories and Subject Descriptors: B.4.2 [Input/Output and Data Communications]: 
Input/Output devices — channels and controllers; B.4.5 [Input/Output and Data 
Communications]: Reliability, Testing, and Fault-Tolerance — redundant design; DA2 [Operating 
Systems]: Storage Management — secondary storage 
General Terms: Algorithms, Design, Performance, Reliability 
Additional Key Words and Phrases: Disk array, RAID, storage hierarchy 



1; INTRODUCTION 

Modem businesses and an increasing number of individuals depend on the 
information stored in the computer systems they use. Even though modern 
disk drives have mean-time-to-failure (mttf) values measured in hundreds of 



Author's addresses: Hewlett-Packard Laboratories, raailstop 1U13, 1501 Page Mill Road, Palo Alto, 
CA 94304-1126; email: {wu^es,goldmg,staelin,sullivan}@hpl.hp.com. 

Permission/to make digital/hard copy of all or part of this material without fee is granted provided 
that the copies are not made or distributed for profit or commercial advantage; the ACM copy- 
right/server notice, the title of the publication, and its date appear; and notice is given that copying is 
by permission of the Association for Computing Machinery, Inc. (ACM). To copy otherwise, to re- 
publish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. 
© 1996 ACM 0734-2071/96/0200-0108 $03.50 



AFRAID — A Frequently Redundant Array of Independent Disks 

Stefan Savage 
University of Washington, Seattle, WA 

John Wilkes 
Hewlett-Packard Laboratories, Palo Alto, CA 



Abstract 

Disk arrays are commonly designed to ensure that 
stored data will always be able to withstand a disk 
failure, but meeting mis goal comes at a significant 
cost in performance. We show that this is unnecessary. 
By trading away a fraction of the enormous reliability 
provided by disk arrays, it is possible to achieve 
performance that is almost as good as a non-parity- 
protected set of disks. 

In particular, our AFRAID design eliminates the small- 
update penalty that plagues traditional RAID 5 disk 
arrays. It does this by applying the data update 
immediately, but delaying the parity update to the next 
quiet period between bursts of client activity. That is, 
afraid makes sure that the array is frequently 
redundant, even if it isn't always so. By regulating the 
parity update policy, afraid allows a smooth trade-off 
between performance and availability. 
Under real-life workloads, the afraid design can 
provide close to the full performance of an array of 
unprotected disks, and data availability comparable to 
a traditional RAID 5. Our results show that AFRAID 
offers 42% better performance for only 10% less 
availability, 97% better for 23% less, and as much as a 
factor of 4.1 times better performance for giving up 
less than half RAID 5's availability. 
We explore here the detailed availability and 
performance implications of the afraid approach. 

1. Introduction 

In a RAID 5 disk array, small writes take a long time to 
complete |Patterson88]. This is known as the "small 
update problem". In such an array, redundancy for a 
stripe of data is provided by a parity block, computed 
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data disk parity 
Figure 1: doing a small update' in a traditional raid 5. 



as the xor of the data blocks in the stripe, in order to 
. allow recovery if any disk fails. If a portion of a stripe 
is updated, the parity data must also be updated to 
preserve the recoverability property (Figure 1). To do 
this, it is necessary to (1) read the old value of the data 
to be overwritten, unless it is already cached in the 
array controller; (2) read the old parity; (3) XOR the 
new data with the old, and XOR the result with the old 
parity to generate the new parity data; (4) write the 
new data and (5) write the new parity. 
Thus, three or four disk I/Os are needed to achieve one 
small write — all of which are in the critical path. In 
contemplating this problem we made the following 
observations: 

• modern disks are extremely reliable — so much so 
that disk array reliability is limited more by its 
support components than its disks; 

• many real workloads have slack periods between 
bursts of client activity; 

• people are already well-used to the notion of time- 
limited exposure to risk. 

These eventually led us to the idea of AFRAID (A 
Frequently Redundant Array of Independent Disks). 1 
afraid is a raid 5 disk array that relaxes the 
coherency between data and parity for short periods of 
time; parity is made consistent again in the idle periods 
between bursts of client writes. Thus the stored data is 
frequently held redundantly, rather than always 
guaranteed to be so. 

In this approach, small updates are not required to wait 
for the parity to be updated, thereby reducing the four 
I/Os in the critical path of the traditional small-update 
protocol to just one: write the new data. The benefit is 
that performance approaches that of an unprotected 
array. The disadvantage is a slightly increased risk of 
data loss from a disk failure, but we will show that this 
increase is small in practice, and also that it can be 
bounded at the cost of some performance. That is, 
AFRAID allows a smooth trade-off between increased 
reliability and mcreased performance. 



Like so many good ideas, ours was of course developed by 
back-determination from the acronym. 



JTfcklw 



Please Date Stamp and Return 



The Commissioner for Patents has received from Bromberg & Sunstein LLP the following re 



Inventor Miloushev, Vladimir 

Title: File Switch and Switched File 

Serial/Patent No.: 10/043,413 

Filing/Issue Date: January 10, 2002 



Provisional Application Cover Sheet 
Description- pages. 
Claims- pages 



3193/102 
2142 

Prieto, Beatriz 
October 30. 2007 



Amendment Transmittal 



Request for Contir 



■d Examin; 



Petition for month extension 
Issue Fee Transmittal & Form PTOL-85b 
Payment of Maintenance Fee 
Assignment/Rccordation Form Cover Sheet 
Check in the amount ofS 
Completion of Filing Requirements 
Transmittal of Fonnal Drawings 
$405 00 charge to deposit account 



